Introduction to R, Text Mining, and Sentiment Analysis
2025-09-23
I’m the Data Equity and Innovation Supervisor for Paid Leave Oregon, where I lead a team of data analysts in transforming complex data into actionable insights.
Outside of work, I explore data through various personal projects that incorporate analytics, visualization, and storytelling. Check out my blogs, dashboards, and talks to see more.
Introduction to R & R Studio
Taylor Data
Text Mining
Sentiment Analysis
R is a statistical programming language that’s incredibly powerful for working with data.
Unlike Python which is built around objects, R is based on functions. For our purposes, that means we’ll be calling functions (like \(F(x)=Y\)) to transform our data rather than attaching properties to objects.
R is open source and freely available.
R has an extensive and coherent set of tools for statistical analysis.
R has an extensive and highly flexible graphical facility capable of production publication quality figures.
R has an extensive support network with numerous online and freely available documents.
R has an expanding set of freely available ‘packages’ to extend R’s capabilities. 1
R Studio is an Integrated Development Environment (IDE), similar to VS Code or Sublime.
R Studio provides a more user-friendly interface, incorporating the R Console, a script editor and other useful functionality (like R markdown and GitHub integration). You can find more information about RStudio here.1
Most of the work we’ll do relies on packages, which are basically toolkits. To use one, you will first need to install it with a base R function install.packages():
After installing a package, you can load it into your current session with library():
When in doubt about what a functions does, or what is in a package, you can type ?function_name() or ?package_name in the R Studio console to open the description page in the Help tab.
CRAN is the Comprehensive R Archive Network. It’s where R packages are stored, tested, and shared with the community. If you install a package in R, you’re usually pulling it from CRAN.
Create and load R projects
R Studio Landscape (Console, help, viewer, environment, files, packages)
Global Options (appearance & layout)
.R files, Quarto files, render code, comments
R Syntax (assignment <-, concatenate c(), extract values [], extract fields $, sequence :, loops)
Plots & ggplots
What is the difference between R and R Studio
How do you find documentation on a package or function?
How do you find the version of R you are currently using?
What is the assignment operator?
The Taylor package is a comprehensive resource for data on Taylor Swift songs. Data comes from ‘Genius’ (lyrics) and ‘Spotify’ (song characteristics).
Useful links: taylor, taylor repo
The downloaded binary packages are in
/var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
There are three main data sets:
taylor_album_songs: which includes lyrics and audio features from the Spotify API for all songs on Taylor’s official studio albums.
taylor_all_songs: Taylor’s entire discography.
taylor_albums: Sumarizes Taylor’s album release history.
# A tibble: 6 × 29
album_name ep album_release track_number track_name artist featuring
<chr> <lgl> <date> <int> <chr> <chr> <chr>
1 Taylor Swift FALSE 2006-10-24 1 Tim McGraw Taylo… <NA>
2 Taylor Swift FALSE 2006-10-24 2 Picture To Burn Taylo… <NA>
3 Taylor Swift FALSE 2006-10-24 3 Teardrops On M… Taylo… <NA>
4 Taylor Swift FALSE 2006-10-24 4 A Place In Thi… Taylo… <NA>
5 Taylor Swift FALSE 2006-10-24 5 Cold As You Taylo… <NA>
6 Taylor Swift FALSE 2006-10-24 6 The Outside Taylo… <NA>
# ℹ 22 more variables: bonus_track <lgl>, promotional_release <date>,
# single_release <date>, track_release <date>, danceability <dbl>,
# energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
# acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
# tempo <dbl>, time_signature <int>, duration_ms <int>, explicit <lgl>,
# key_name <chr>, mode_name <chr>, key_mode <chr>, lyrics <list>
This package creates a data table with sorting and pagination. The default table is an HTML widget that can be used in RMD and Shiny applications, or viewed from an R console.
The downloaded binary packages are in
/var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
Lets look at one song:
A package for working with data frames that makes it easy to filter, sort, group, and summarize data using simple, readable functions.
The downloaded binary packages are in
/var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
A package that helps turn text (like lyrics or survey responses) into tidy data frames, so you can analyze words, sentiments, and topics with the same tools you use for numbers.
The downloaded binary packages are in
/var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
A package for reshaping data, used to make messy data “tidy” by separating, combining, or pivoting columns so each row is an observation and each column is a variable.
The downloaded binary packages are in
/var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
A built-in dataset of very common words like the, and, of that usually don’t add much meaning. We remove these so the analysis focuses on more meaningful words.
A custom list of filler or vocalization words (like ooh, ah, la) that appear in lyrics but don’t carry much meaning. We can filter these out so they don’t distract from the main analysis.
A dataset of words like not, no, never, without that flip the meaning of the words around them. This is useful in sentiment analysis because “not happy” is very different from “happy.”
What is the importance of stop words?
Why are negations important to consider when analyzing sentiment?
What does the doc column represent in the dataset we created from Taylor’s lyrics?
Why might we want to keep each line of a song instead of collapsing the whole song into one string?
These definitions might feel elementary at first, but that’s the point. The more clearly you understand these simple ideas, the easier it will be to make sense of the more complex modeling steps later.
Word: a single word, and smallest unit of analysis.
Text: the written content inside a document (the lyrics of a single song).
Document: a unit of text (a single song).
Corpus: the full collection of texts (all song lyrics).
Vocabulary: the unique set of words across all documents in a set.
N-grams: sequence of words (1 = unigram, 2 = bigram, 3 = trigram), we we can use to look for corpus themes.
Stop Words: Stop words are common words (like the, and, is, in, at, on) that usually don’t add much meaning for text analysis.
Sentiment: Sentiment refers to the emotional tone or attitude expressed in text.
This splits a column into tokens, flattening the table into one-token-per-row.
Let’s start by tokenizing our data set, keeping stop words and song words.
Notice that there are 132,140 total words in our Corpus.
When we count the words we see that there are only 5,146 words in this vocabulary, with the most common words being you, I, the, and and.
To visualize the most common words in our vocabulary we are going to use the ggplot2 package.
The downloaded binary packages are in
/var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
Then we will use the following code to visualize the most used words in our vocabulary.
Since we are not able to derive much insight from that initial analysis, lets try again but removing stop words and our custom song words data frames using an anti_join.
bigram <- Data |>
select(doc, album, lyrics) |>
mutate(lyrics = tolower(lyrics)) |>
unnest_tokens(
bigram,
lyrics,
token = "ngrams",
n = 2) |>
separate(
col = bigram,
sep = " ",
into = c("w1", "w2"),
remove = FALSE
) |>
filter(!w1 %in% stop_words$word) |>
filter(!w2 %in% stop_words$word) |>
filter(!w1 %in% song_words$word) |>
filter(!w2 %in% song_words$word) |>
filter(!is.na(bigram))What’s the difference between a document, text, and the corpus in this project?
Why were stop words and song words removed?
Your facet plot is not sorted within each album. What two helpers fix it?
Why does slice_max(n, n = 10) differ from filter(n > 10) after count()?
The tidytext package provides access to several sentiment lexicons.
The downloaded binary packages are in
/var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
There are three general-purpose lexicons are included wiht tidytext. We will be using two.
bing: the words are assigned scores for positive/negative sentiment.
# A tibble: 6,786 × 2
word sentiment
<chr> <chr>
1 2-faces negative
2 abnormal negative
3 abolish negative
4 abominable negative
5 abominably negative
6 abominate negative
7 abomination negative
8 abort negative
9 aborted negative
10 aborts negative
# ℹ 6,776 more rows
afinn: assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
# A tibble: 2,477 × 2
word value
<chr> <dbl>
1 abandon -2
2 abandoned -2
3 abandons -2
4 abducted -2
5 abduction -2
6 abductions -2
7 abhor -3
8 abhorred -3
9 abhorrent -3
10 abhors -3
# ℹ 2,467 more rows
These dictionaries were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data. Given this information, we may hesitate to apply these sentiment lexicons to styles of text dramatically different from what they were validated on. 1
(Cont)
(add info about)
(add info)
Good bye thanks